home *** CD-ROM | disk | FTP | other *** search
- WORDS Version 1.2 - A fast word extractor program by Edwin Floyd. 3/28/91
-
- There are no operational changes from previous releases. This is
- as source maintenance update for TP 6.0, and for release as part
- of the SPELCHEK package. See SPELCHEK.DOC.
-
- Purpose of WORDS
- ----------------
- WORDS extracts a list of unique "words" from an input file, or
- several input files, and writes them to an output file, one per
- line. The program recognizes a number of options for:
-
- o Set operations on multiple files
-
- o Case sensitivity
-
- o High-order bit stripping
-
- o Alphabetic output sort
-
- o Defining the characters comprising a "word"
-
- How to run WORDS
- ----------------
- From the DOS command line enter:
-
- WORDS filenames [-U/-I/-C] [-A] [-L] [-H] [-W[+/-]abc..]
- [-Oname] [@name]
-
- Spaces delimit command line parameters. You may intermingle
- input text filenames and options (mark each option with a leading
- hyphen). Filenames may include wild-card characters. Some options
- (-W,-O) allow a character string or filename to follow the option
- letter. This must follow with no intervening spaces or the program
- will mistake it for an input file name. Some options (-A,-L,-H) allow
- a "+" or "-" to indicate "on" or "off". This also must follow with no
- intervening space, and "+" is assumed if it is omitted. You may
- place options and filenames in an ASCII "include" file and
- specify its name with a leading "@" on the command line. An
- include file may contain references to other include files. You
- also may specify default options, filenames and include files in
- the DOS environment using "SET WORDS=...". For example:
-
- SET WORDS=-U -A+ -L+ -Owords.out -W-ABCDEFGHIJKLMNOPQRSTUVWXYZ
- SET WORDS=@defaults.wrd -O
-
- WORDS processes options left-to-right, first from the DOS
- environment, then from the command line. Where options conflict,
- the last option processed prevails. Thus, you may override "SET"
- environment options on the command line.
-
- What the options mean
- ---------------------
- -U, -I or -C specifies the set operation to be performed on the
- extracted words from the input files. Only one of these options
- is active for any given WORDS run. The operations are:
-
- -U Union: Keep all unique words from any input file.
- This is the default.
-
- -I Intersection: Keep unique words common to all input
- files.
-
- -C Complement: Keep unique words from the second and
- subsequent files, only if they are NOT contained in
- the first file.
-
- Other options:
-
- -A[+/-] Sort output words alphabetically (default off). If
- -A is off, output words will be in order of first
- encounter in the input files.
-
- -H[+/-] Clear the high-order bit on each input character
- (default off). Use this option to process files
- created by word processing programs, like WordStar,
- that mark some letters by setting the high-order
- bit, often at the beginning or end of a word.
-
- -L[+/-] Lower case is significant (default off). If -L is
- off, the program will shift all output words to
- upper case.
-
- -W-abc.. Replace the "word character set" with the indicated
- characters. The program checks each character in
- each input file for membership in the word character
- set and defines a "word" as an uninterrupted
- sequence of at least one but no more than 35
- characters which are members of that set. The
- default is the set of upper and lower case
- alphabetic characters.
-
- -W+abc.. Add additional characters to the word character set.
-
- -O[name] Name the output file. If the name is omitted ("-O "),
- output goes to "StdOut" and is available for DOS a
- pipe (|) or redirection (>). StdOut is the
- default.
-
- -O- Suppress output. -Onul also suppresses output. The
- program will still display word counts on the
- screen.
-
- Three examples
- --------------
- 1. Generate an alphabetized list of all words appearing in the
- document named WORDS.DOC and write the list to file WORDS.LST.
- The following are equivalent:
-
- WORDS words.doc -U -A -Owords.lst
-
- WORDS words.doc -A >words.lst (defaults: -U, StdOut)
-
- WORDS -U words.doc -A+ >words.lst
-
- SET WORDS=-A+ -Owords.lst (set defaults)
- WORDS words.doc
-
- 2. Given a previously extracted list of words in file SPELL.CHK,
- generate a list of words from file LETTER.DOC which are NOT in
- SPELL.CHK, and write the list to LETTER.BWD.
-
- WORDS -C spell.chk letter.doc -Oletter.bwd
-
- (A poor persons spelling checker?)
-
- 3. Given file PASCAL.PRC containing a list of Pascal library
- procedure names, determine which procedures are referenced by
- Pascal source program BIGPROG.PAS and write the referenced
- procedure names to the screen with a pause at each screen full.
-
- WORDS pascal.prc bigprog.pas -I -W+_0123456789 | more
-
- (Pascal identifiers may contain numerics or the "_" (underline)
- character, so we add these to the word set via "-W+_01..".)
-
- Limitations
- -----------
- A "word" may be no longer than 35 characters. No more than
- 65,535 unique words may accumulate. All data must fit in main
- memory (this usually determines the limit). Each unique word
- occupies its length in memory, plus an overhead of about 7 bytes,
- plus another 10 bytes at output time if output is alphabetized
- (-A+). Thus, with a typical available memory of 500k and 5
- characters (average) per word, you will run out of memory after
- about 40,000 unique words, unsorted.
-
- FYI, network users, WORDS opens its input files in "Read, Deny
- None" mode, @include files "Read, Compatibility", and the output
- file in "Write, Compatibility". Only one file at a time is open,
- except during processing of nested @include files.
-
- Legal Stuff
- -----------
- WORDS.EXE and WORDS.DOC and all Pascal source files are:
-
- Copyright (c) 1990,91 by Edwin T. Floyd,
- All rights reserved.
-
- WORDS is copyrighted "free" software. The author hereby
- expressly permits and encourages individuals to use WORDS at home
- and at work and to distribute it without charge. The author
- prohibits distribution of WORDS for profit, or as a part of a
- product sold for profit, except where explicit written permission
- has been obtained from the author for such distribution. Also,
- users groups and shareware libraries charging a disk duplication
- fee not exceeding $10.00 may distribute WORDS.
-
- The author makes no warranties of any kind, either expressed or
- implied, as to mercantability or fitness for any particular
- purpose. WORDS.EXE and WORDS.DOC are available as is and in no
- event will the author be held liable for damages, including any
- lost profits or incidental or consequential damages, even if the
- author has been advised of the possibility of such damages.
-
- Authorship
- ----------
- WORDS was written in Turbo Pascal v6.0 by:
-
- Edwin T. Floyd [76067,747] (CompuServe)
- #9 Adams Park Court 404/576-3305 (work)
- Columbus, GA 31909 404/322-0076 (home)
-
- The latest version of WORDS is available on CompuServe in the
- IBMPRO forum, and on a number of bulletin boards around the
- country.
- - Edwin - 3-28-91
-
- Update History
- --------------
- 03-30-90 ETF V1.0 Initial public release
- 11-12-90 ETF V1.1 Update for TP6.0 (not released)
- 03-28-91 ETF V1.2 Token routines changed, public release with SPELCHEK
-